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LONG-TERM  GOAL 

Our  long-term  goal  is  to  investigate  the  potential  of  multicomputers  constructed  of  commercial,  off-the- 
shelf  components  as  a  replacement  for  conventional  parallel  processors  in  high-performance  computing. 
By  commercial,  off-the-shelf  components,  we  mean  commodity  workstations  and  personal  computers 
connected  by  readily  available  networking  fabric,  e.g.  DEC  Alpha  and  Intel  PC  boxes  connected  by 
gigabit  ethernet.  To  understand  the  impact  of  clusters  in  this  area,  we  are  developing  several  layers  of 
software  to  support  traditional  supercomputing  applications  such  as  climate  and  ocean  modeling. 

OBJECTIVES 

We  hope  to  demonstrate  the  efficacy  of  constructing  medium-scale,  clustered  multicomputers  for  a 
fraction  of  the  cost  of  traditional  supercomputers.  We  are  building  a  clustered  system  called  Centurion. 
Using  Centurion,  we  will  demonstrate  that  we  can  outperform  a  cluster  of  C90  machines,  and  at  a 
fraction  of  the  cost.  This  research  can  point  the  way  to  developing  more  effective,  lower-cost 
computational  engines  for  the  Department  of  Defense. 

APPROACH 

While  the  original  proposal  was  to  build  a  50-Gigaflop  machine,  the  resultant  machine  will  have  more 
than  a  200  Gigaflop  peak  rate.  We  are  building  this  multicomputer  using  off-the-shelf  commodity 
processors  based  on  both  DEC  Alpha  and  Intel  commodity  processors.  One  third  of  the  machine  is 
connected  in  a  2D  torus  using  1.28  Gb/s  Myrinet  switches;  all  processors  are  interconnected  by  fast 
ethemet  switches  and  gigabit  ethernet  hubs.  This  machine  will  be  used  to  solve  computationally 
challenging  science  and  engineering  problems  and  as  a  testbed  for  systems-oriented  computer  science 
research.  The  heterogeneous  nature  of  the  machine,  both  in  terms  of  processor  architecture  and 
networking  fabric,  gives  us  a  rich  environment  for  determining  the  best  operational  aspects  of  our 
application  suite. 

Our  application  suite  consists  of  two  ocean  simulations,  a  directed  vapor  deposition  code,  a  DNA 
sequence  comparison  code,  a  polyatomic  system  simulator,  a  biomolecular  simulation,  a 
macromolecular  dynamics  and  mechanics  code,  and  a  system  for  performing  molecular  orbital 
calculations.  For  details,  see  http://legion.virginia.edu/centurion/Applications.html. 
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WORK  COMPLETED 


We  have  begun  acquisition  of  the  required  equipment,  having  issued  requests  for  bids  on  66  533-MHz 
Alpha  machines  and  66  350  MHz  dual-processor  Intel  Pentium  II  machines.  We  have  received 
approximately  3  dozen  responses  to  these  requests,  and  are  in  the  process  of  reconciling  the  bids  with 
the  requirements  to  determine  the  vendor  which  will  be  awarded  the  contract.  We  are  also  in  the 
process  of  acquiring  the  necessary  Ethernet  switches  to  interconnect  the  new  machines  and  join  them  to 
our  existing  cluster. 

In  the  meantime,  we  have  ported  the  Navy  Layered  Ocean  Model  (NLOM)  to  our  existing  cluster  of  64 
DEC  Alpha  machines,  and  have  obtained  the  Miami  Isopycnic  Coordinate  Ocean  Model  (MICOM) 
software  for  porting  to  Centurion.  We  have  also  ported  an  axially-symmetric  direct  simulation  Monte 
Carlo  code  to  the  machine. 

RESULTS 

It  is  too  early  in  the  acquisition  cycle  for  Centurion  for  us  to  have  meaningful  results  on  the  full  cluster. 
Our  1999  Annual  Report  will  contain  analysis  of  the  software  performance  achieved  on  the  new  cluster. 

We  can  report  results  from  the  NLOM  code  on  the  existing  Centurion  cluster.  In  this  case,  we  started 
with  an  unoptomized  version  of  the  code,  and  benchmarked  on  4  processors  within  Centurion.  The 
following  table  lists  the  various  optimizations  made  and  their  effects  on  performance. 


Description 

Performance 

Out  of  the  box 

~70  Mflops  on  4  nodes 

Cache  block  2  main  loops,  split  communication 

-175  Mflops 

Inter-procedural  analysis,  eliminate  2  matrix  copies 

-195  Mflops 

Eliminate  communcation  associated  with  extra  copies 

-235  Mflops 

Loop  jam 

-220  Mflops 

Cache  block  jammed  loop 

-355  Mflops 

Fix  Legion-MPI  bug 

-480  Mflops 

In  addition,  we  ran  the  ocean  model  on  49  processors.  The  results  of  this  run  are  depicted  in  the 
following  graph.  It  should  be  noted  that  we  achieved  3.7GF  with  $150,000  worth  of  equipment. 


Performance 


♦  Centurion 
SGI02K2GB 
Sun  E1OK2G0 
—  Single  Processor  T90 


Processors 


We  also  measured  performance  for  an  axially-symmetric  steady  flow  materials  simulation.  This 
program  simulates  the  flow  of  a  gas  jet  during  chip  manufacture;  the  intent  is  to  improve  fabrication 
yield.  The  original  code  took  more  than  one  week  to  execute  on  a  single  RS/6000.  On  52  nodes  of  the 
Centurion  cluster,  it  took  41  minutes  after  parallelization.  We  did  observe  superlinear  speedup  because 
of  cache  effects.  The  speedup  results  appear  in  the  following  table  and  graph;  full  details  are  in  [1]. 
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IMPACT/APPLICATION 

This  DURIP  award  will  more  than  triple  the  available  computing  power  in  our  current  Centurion 
cluster.  This  will  allow  us  to  make  real-world  computational  and  bang-for-the-buck  comparisions 
between  our  cluster,  obtained  for  a  total  investment  of  less  than  $1,000,000,  to  that  of  the  Cray 
machines  at  the  NaVO  MSRC  in  Stennis,  MS  on  the  same  applications. 

TRANSITIONS 

It  is  too  early  in  the  project  for  anyone  to  have  adopted  our  methods.  When  we  have  installed  the  full 
system  and  performed  experimental  analysis,  we  will  present  our  results  to  the  various  DoD  MSRCs 
and  Research  Labs.  We  hope  that,  as  a  result  of  this  work,  cluster  computing  will  become  a  mainstream 
computing  resource  for  the  DoD. 

RELATED  PROJECTS 

1  -  We  are  developing  the  Legion  run-time  system  for  use  on  the  Centurion  cluster.  Legion  will 
provide  automated  binary  management,  an  object-oriented  computing  paradigm  (but  also  supporting 


legacy  code  in  Fortran),  security,  resource  management,  and  multi-language/environment  support, 
including  MPI,  PVM,  Fortran,  C++,  and  Java. 

2  -  Under  the  HOSS  project,  I  am  developing  operating  system  mechanisms  to  support  high- 
performance  computing  on  clusters.  My  current  focus  is  on  flexible  address-space  (distributed  shared 
memory)  management,  in  collaboration  with  Mike  Hereoux  of  Sandia  National  Labs  and  David  Bader  of 
the  University  of  New  Mexico. 
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